options(repos = c(CRAN = "https://cran.rstudio.com/"))HR ANALYTICS EMPLOYEE ATTRITION AND PERFORMANCE
BCon 147: special topics
1 Project overiew
In this project, we will explore employee attrition and performance using the HR Analytics Employee Attrition & Performance dataset. The primary goal is to develop insights into the factors that contribute to employee attrition. By analyzing a range of factors, including demographic data, job satisfaction, work-life balance, and job role, we aim to help businesses identify key areas where they can improve employee retention.
2 Scenario
Imagine you are working as a data analyst for a mid-sized company that is experiencing high employee turnover, especially among high-performing employees. The company has been facing increased costs related to hiring and training new employees, and management is concerned about the negative impact on productivity and morale. The human resources (HR) team has collected historical employee data and now looks to you for actionable insights. They want to understand why employees are leaving and how to retain talent effectively.
Your task is to analyze the dataset and provide insights that will help HR prioritize retention strategies. These strategies could include interventions like revising compensation policies, improving job satisfaction, or focusing on work-life balance initiatives. The success of your analysis could lead to significant cost savings for the company and an increase in employee engagement and performance.
3 Understanding data source
The dataset used for this project provides information about employee demographics, performance metrics, and various satisfaction ratings. The dataset is particularly useful for exploring how factors such as job satisfaction, work-life balance, and training opportunities influence employee performance and attrition.
This dataset is well-suited for conducting in-depth analysis of employee performance and retention, enabling us to build predictive models that identify the key drivers of employee attrition. Additionally, we can assess the impact of various organizational factors, such as training and work-life balance, on both performance and retention outcomes.
## datatable function from DT package create an HTML widget display of the dataset
## install DT package if the package is not yet available in your R environment
readxl::read_excel("dataset/dataset-variable-description.xlsx") |>
DT::datatable()4 Data wrangling and management
Libraries
Before we start working on the dataset, we need to load the necessary libraries that will be used for data wrangling, analysis and visualization. Make sure to load the following libraries here. For packages to be installed, you can use the install.packages function. There are packages to be installed later on this project, so make sure to install them as needed and load them here.
# load all your libraries here
install.packages(c("dplyr", "ggplot2", "DT", "janitor", "GGally", "sjPlot", "report", "ggstatsplot"))package 'dplyr' successfully unpacked and MD5 sums checked
package 'ggplot2' successfully unpacked and MD5 sums checked
package 'DT' successfully unpacked and MD5 sums checked
package 'janitor' successfully unpacked and MD5 sums checked
package 'GGally' successfully unpacked and MD5 sums checked
package 'sjPlot' successfully unpacked and MD5 sums checked
package 'report' successfully unpacked and MD5 sums checked
package 'ggstatsplot' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\GCM Maribeth\AppData\Local\Temp\RtmpMDYtB0\downloaded_packages
library(dplyr)
library(ggplot2)
library(DT)
library(janitor)
library(GGally)
library(sjPlot)
library(report)
library(ggstatsplot)4.1 Data importation
Import the two dataset
Employee.csvandPerformanceRating.csv. Save theEmployee.csvasemployee_dtaandPerformanceRating.csvasperf_rating_dta.Merge the two dataset using the
left_joinfunction fromdplyr. Use theEmployeeIDvariable as the varible to join by. You may read more information about theleft_joinfunction here.Save the merged dataset as
hr_perf_dtaand display the dataset using thedatatablefunction fromDTpackage.
## import the two data here
employee_dta <- read.csv("dataset/Employee.csv")
perf_rating_dta <- read.csv("dataset/PerformanceRating.csv")
## merge employee_dta and perf_rating_dta using left_join function.
## save the merged dataset as hr_perf_dta
hr_perf_dta <- left_join(employee_dta, perf_rating_dta, by = "EmployeeID")
## Use the datatable from DT package to display the merged dataset
DT::datatable(hr_perf_dta)4.2 Data management
Using the
clean_namesfunction fromjanitorpackage, standardize the variable names by using the recommended naming of variables.Save the renamed variables as
hr_perf_dtato update the dataset.
## clean names using the janitor packages and save as hr_perf_dta
hr_perf_dta <- hr_perf_dta %>% clean_names()
DT::datatable(hr_perf_dta)## display the renamed hr_perf_dta using datatable function
datatable(hr_perf_dta)Create a new variable
cat_educationwhereineducationis1=No formal education;2=High school;3=Bachelor;4=Masters;5=Doctorate. Use thecase_whenfunction to accomplish this task.Similarly, create new variables
cat_envi_sat,cat_job_sat, andcat_relation_satforenvironment_satisfaction,job_satisfaction, andrelationship_satisfaction, respectively. Re-code the values accordingly as1=Very dissatisfied;2=Dissatisfied;3=Neutral;4=Satisfied; and5=Very satisfied.Create new variables
cat_work_life_balance,cat_self_rating,cat_manager_ratingforwork_life_balance,self_rating, andmanager_rating, respectively. Re-code accordingly as1=Unacceptable;2=Needs improvement;3=Meets expectation;4=Exceeds expectation; and5=Above and beyond.Create a new variable
bi_attritionby transformingattritionvariable as a numeric variabe. Re-code accordingly asNo=0, andYes=1.Save all the changes in the
hr_perf_dta. Note that saving the changes with the same name will update the dataset with the new variables created.
hr_perf_dta <- hr_perf_dta %>% mutate(cat_education = case_when(education == 1 ~ "No formal education", education == 2 ~ "High school", education == 3 ~ "Bachelor", education == 4 ~ "Masters", education == 5 ~ "Doctorate",TRUE ~ NA_character_ ))
## create cat_education
r_perf_dta <- hr_perf_dta %>%
mutate(
cat_envi_sat = case_when(
environment_satisfaction == "Very dissatisfied" ~ 1,
environment_satisfaction == "Dissatisfied" ~ 2,
environment_satisfaction == "Neutral" ~ 3,
environment_satisfaction == "Satisfied" ~ 4,
environment_satisfaction == "Very satisfied" ~ 5,
TRUE ~ NA_real_ # Handle any unexpected values
),
cat_job_sat = case_when(
job_satisfaction == "Very dissatisfied" ~ 1,
job_satisfaction == "Dissatisfied" ~ 2,
job_satisfaction == "Neutral" ~ 3,
job_satisfaction == "Satisfied" ~ 4,
job_satisfaction == "Very satisfied" ~ 5,
TRUE ~ NA_real_
),
cat_relation_sat = case_when(
relationship_satisfaction == "Very dissatisfied" ~ 1,
relationship_satisfaction == "Dissatisfied" ~ 2,
relationship_satisfaction == "Neutral" ~ 3,
relationship_satisfaction == "Satisfied" ~ 4,
relationship_satisfaction == "Very satisfied" ~ 5,
TRUE ~ NA_real_
)
)
## create cat_envi_sat, cat_job_sat, and cat_relation_sat
r_perf_dta <- hr_perf_dta %>%
mutate(
cat_envi_sat = case_when(
environment_satisfaction == "Very dissatisfied" ~ 1,
environment_satisfaction == "Dissatisfied" ~ 2,
environment_satisfaction == "Neutral" ~ 3,
environment_satisfaction == "Satisfied" ~ 4,
environment_satisfaction == "Very satisfied" ~ 5,
TRUE ~ NA_real_ # Handle any unexpected values
),
cat_job_sat = case_when(
job_satisfaction == "Very dissatisfied" ~ 1,
job_satisfaction == "Dissatisfied" ~ 2,
job_satisfaction == "Neutral" ~ 3,
job_satisfaction == "Satisfied" ~ 4,
job_satisfaction == "Very satisfied" ~ 5,
TRUE ~ NA_real_
),
cat_relation_sat = case_when(
relationship_satisfaction == "Very dissatisfied" ~ 1,
relationship_satisfaction == "Dissatisfied" ~ 2,
relationship_satisfaction == "Neutral" ~ 3,
relationship_satisfaction == "Satisfied" ~ 4,
relationship_satisfaction == "Very satisfied" ~ 5,
TRUE ~ NA_real_
)
)
## create cat_work_life_balance, cat_self_rating, and cat_manager_rating
r_perf_dta <- hr_perf_dta %>%
mutate(
cat_work_life_balance = case_when(
work_life_balance == "Very bad" ~ 1,
work_life_balance == "Bad" ~ 2,
work_life_balance == "Neutral" ~ 3,
work_life_balance == "Good" ~ 4,
work_life_balance == "Very good" ~ 5,
TRUE ~ NA_real_ # Handle any unexpected values
),
cat_self_rating = case_when(
self_rating == "Very poor" ~ 1,
self_rating == "Poor" ~ 2,
self_rating == "Average" ~ 3,
self_rating == "Good" ~ 4,
self_rating == "Excellent" ~ 5,
TRUE ~ NA_real_
),
cat_manager_rating = case_when(
manager_rating == "Very poor" ~ 1,
manager_rating == "Poor" ~ 2,
manager_rating == "Average" ~ 3,
manager_rating == "Good" ~ 4,
manager_rating == "Excellent" ~ 5,
TRUE ~ NA_real_
)
)
## create bi_attrition
hr_perf_dta <- hr_perf_dta %>%
mutate(
bi_attrition = case_when(
attrition == "Yes" ~ 1, # Assuming 'attrition' column has "Yes" for employees who left
attrition == "No" ~ 0, # Assuming 'attrition' column has "No" for employees still employed
TRUE ~ NA_real_ # Handle any unexpected values
)
)
## print the updated hr_perf_dta using datatable function
datatable(hr_perf_dta)5 Exploratory data analysis
5.1 Descriptive statistics of employee attrition
Select the variables
attrition,job_role,department,age,salary,job_satisfaction, andwork_life_balance.Save asattrition_key_var_dta.Compute and plot the attrition rate across
job_role,department, andage,salary,job_satisfaction, andwork_life_balance. To compute for the attrition rate, group the dataset by job role. Afterward, you can use thecountfunction to get the frequency of attrition for each job role and then divide it by the total number of observations. Save the computation aspct_attrition. Do not forget to ungroup before storing the output. Store the output asattrition_rate_job_role.Plot for the attrition rate across
job_rolehas been done for you! Study each line of code. You have the freedom to customize your plot accordingly. Show your creativity!
## selecting attrition key variables and save as `attrition_key_var_dta`
attrition_key_var_dta <- hr_perf_dta %>%
select(attrition, job_role, department, age, salary, job_satisfaction, work_life_balance)
## compute the attrition rate across job_role and save as attrition_rate_job_role
attrition_rate_job_role <- employee_dta %>%
group_by(JobRole) %>%
summarise(
total_employees = n(),
total_attrition = sum(Attrition == "Yes", na.rm = TRUE)
) %>%
mutate(pct_attrition = total_attrition / total_employees * 100) %>%
ungroup()
## print attrition_rate_job_role
print(attrition_rate_job_role)# A tibble: 13 × 4
JobRole total_employees total_attrition pct_attrition
<chr> <int> <int> <dbl>
1 Analytics Manager 52 3 5.77
2 Data Scientist 261 62 23.8
3 Engineering Manager 75 2 2.67
4 HR Business Partner 7 0 0
5 HR Executive 28 3 10.7
6 HR Manager 4 0 0
7 Machine Learning Engineer 146 10 6.85
8 Manager 37 2 5.41
9 Recruiter 24 9 37.5
10 Sales Executive 327 57 17.4
11 Sales Representative 83 33 39.8
12 Senior Software Engineer 132 9 6.82
13 Software Engineer 294 47 16.0
# Attrition Rate by Job Role
ggplot(attrition_rate_job_role, aes(x = reorder(JobRole, -pct_attrition), y = pct_attrition)) +
geom_bar(stat = "identity", fill = "red") +
labs(title = "Attrition Rate by Job Role", x = "Job Role", y = "Attrition Rate (%)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))## compute the attrition rate across department and save as attrition_rate_department
attrition_rate_department <- employee_dta %>%
group_by(Department) %>%
summarise(
total_employees = n(),
total_attrition = sum(Attrition == "Yes", na.rm = TRUE)
) %>%
mutate(pct_attrition = total_attrition / total_employees * 100) %>%
ungroup()
## print attrition_rate_department
print(attrition_rate_department)# A tibble: 3 × 4
Department total_employees total_attrition pct_attrition
<chr> <int> <int> <dbl>
1 Human Resources 63 12 19.0
2 Sales 446 92 20.6
3 Technology 961 133 13.8
# Attrition Rate by Department
ggplot(attrition_rate_department, aes(x = reorder(Department, -pct_attrition), y = pct_attrition)) +
geom_bar(stat = "identity", fill = "lightgreen") +
labs(title = "Attrition Rate by Department", x = "Department", y = "Attrition Rate (%)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))## compute the attrition rate across age and save as attrition_rate_age
attrition_rate_age <- attrition_key_var_dta %>%
mutate(age_group = cut(age, breaks = c(20, 30, 40, 50, 60, 70),
labels = c("20-29", "30-39", "40-49", "50-59", "60-69"),
right = FALSE)) %>%
group_by(age_group) %>%
summarise(
total_employees = n(),
total_attrition = sum(attrition == "Yes", na.rm = TRUE), # Adjust to the correct column name
pct_attrition = (total_attrition / total_employees) * 100
) %>%
ungroup()
## print attrition_rate_age
print(attrition_rate_age)# A tibble: 5 × 4
age_group total_employees total_attrition pct_attrition
<fct> <int> <int> <dbl>
1 20-29 3777 1752 46.4
2 30-39 1619 259 16.0
3 40-49 1318 142 10.8
4 50-59 8 0 0
5 <NA> 177 108 61.0
# Attrition Rate by Age
ggplot(attrition_rate_age, aes(x = age_group, y = pct_attrition)) +
geom_bar(stat = "identity", fill = "skyblue", color = "yellow") +
geom_text(aes(label = round(pct_attrition, 1)), vjust = -0.5, size = 3.5) +
labs(title = "Attrition Rate by Age Group",
x = "Age Group",
y = "Attrition Rate (%)") +
ylim(0, 80) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, face = "bold", color = "purple"),
plot.title = element_text(hjust = 0.5, face = "bold", color = "darkred"),
plot.margin = unit(c(1, 1, 1, 1.5), "cm"))# Create salary bins and compute attrition rate
attrition_rate_salary <- attrition_key_var_dta %>%
mutate(salary_group = cut(salary,
breaks = c(0, 30000, 50000, 70000, 90000, 110000, Inf),
labels = c("0-30k", "30k-50k", "50k-70k", "70k-90k", "90k-110k", "110k+"),
right = FALSE)) %>%
group_by(salary_group) %>%
summarise(
total_employees = n(),
total_attrition = sum(attrition == "Yes", na.rm = TRUE), # Adjust to the correct column name
pct_attrition = (total_attrition / total_employees) * 100
) %>%
ungroup()
# View the computed attrition rates
print(attrition_rate_salary)# A tibble: 6 × 4
salary_group total_employees total_attrition pct_attrition
<fct> <int> <int> <dbl>
1 0-30k 673 425 63.2
2 30k-50k 1469 686 46.7
3 50k-70k 1095 384 35.1
4 70k-90k 770 211 27.4
5 90k-110k 665 116 17.4
6 110k+ 2227 439 19.7
# attrition rate by salary group
ggplot(attrition_rate_salary, aes(x = salary_group, y = pct_attrition)) +
geom_col(fill = "steelblue") + # Set the fill color to steelblue
labs(title = "Attrition Rate by Salary Group", x = "Salary Group", y = "Attrition Rate (%)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))## compute the attrition rate across job_satisfaction and save as attrition_rate_job_satisfaction
attrition_rate_job_satisfaction <- attrition_key_var_dta %>%
group_by(job_satisfaction) %>%
summarise(
total_employees = n(),
total_attrition = sum(attrition == "Yes", na.rm = TRUE), # Adjust to the correct column name
pct_attrition = (total_attrition / total_employees) * 100
) %>%
ungroup()
# View the computed attrition rates
print(attrition_rate_job_satisfaction)# A tibble: 6 × 4
job_satisfaction total_employees total_attrition pct_attrition
<int> <int> <int> <dbl>
1 1 130 36 27.7
2 2 1674 549 32.8
3 3 1651 568 34.4
4 4 1685 573 34.0
5 5 1569 535 34.1
6 NA 190 0 0
# attrition rate by job satisfaction
ggplot(attrition_rate_job_satisfaction, aes(x = reorder(job_satisfaction, pct_attrition), y = pct_attrition)) +
geom_bar(stat = "identity", fill = "lightpink", color = "darkred") +
geom_text(aes(label = paste0(round(pct_attrition, 1), "%")),
vjust = -0.5, size = 4) +
labs(title = "Attrition Rate by Job Satisfaction",
x = "Job Satisfaction Level",
y = "Attrition Rate (%)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))## Compute attrition rate by work-life balance
attrition_rate_work_life <- attrition_key_var_dta %>%
group_by(work_life_balance) %>% # Use the correct variable for work-life balance
summarise(
total_employees = n(),
total_attrition = sum(attrition == "Yes", na.rm = TRUE), # Adjust to your attrition column name
pct_attrition = (total_attrition / total_employees) * 100
) %>%
ungroup()
# View the computed attrition rates
print(attrition_rate_work_life)# A tibble: 6 × 4
work_life_balance total_employees total_attrition pct_attrition
<int> <int> <int> <dbl>
1 1 121 37 30.6
2 2 1702 568 33.4
3 3 1670 580 34.7
4 4 1706 560 32.8
5 5 1510 516 34.2
6 NA 190 0 0
# attrition rate by work-life balance
ggplot(attrition_rate_work_life, aes(x = reorder(work_life_balance, pct_attrition), y = pct_attrition)) +
geom_bar(stat = "identity", fill = "lightblue", color = "darkblue") +
geom_text(aes(label = paste0(round(pct_attrition, 1), "%")),
vjust = -0.5, size = 4) + # Add percentage labels above bars
labs(title = "Attrition Rate by Work-Life Balance",
x = "Work-Life Balance Level",
y = "Attrition Rate (%)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Angle x-axis text for readability5.2 Identifying attrition key drivers using correlation analysis
Conduct a correlation analysis of key variables:
bi_attrition,salary,years_at_company,job_satisfaction,manager_rating, andwork_life_balance. Use thecor()function to run the correlation analysis. Remove missing values using thena.omit()before running the correlation analysis. Save the output inhr_corr.Use a correlation matrix or heatmap to visualize the relationship between these variables and attrition. You can use the
GGallypackage and use theggcorrfunction to visualize the correlation heatmap. You may explore this site for more information: ggcorr.Discuss which factors seem most correlated with attrition and what that suggests aobut why employees are leaving.
## conduct correlation of key variables.
hr_key_vars <- hr_perf_dta %>%
select(bi_attrition, salary, years_at_company, job_satisfaction, manager_rating, work_life_balance)
hr_key_vars_clean <- na.omit(hr_key_vars)
hr_corr <- cor(hr_key_vars_clean)
## print hr_corr
datatable(hr_corr)## install GGally package and use ggcorr function to visualize the correlation
library(GGally)
hr_key_vars <- hr_perf_dta %>%
select(bi_attrition, salary, years_at_company, job_satisfaction, manager_rating, work_life_balance)
hr_key_vars_clean <- na.omit(hr_key_vars)
# Create a correlation plot with a custom color palette
ggcorr(hr_key_vars_clean,
method = c("everything", "pearson"),
palette = colorRampPalette(c("yellow", "red", "green")),
label = TRUE,
label_round = 1,
label_size = 2,
hjust = 0.75,
size = 3)Provide your discussion here.
5.3 Predictive modeling for attrition
Create a logistic regression model to predict employee attrition using the following variables:
salary,years_at_company,job_satisfaction,manager_rating, andwork_life_balance. Save the model ashr_attrition_glm_model. Print the summary of the model using thesummaryfunction.Install the
sjPlotpackage and use thetab_modelfunction to display the summary of the model. You may read the documentation here on how to customize your model summary.Also, use the
plot_modelfunction to visualize the model coefficients. You may read the documentation here on how to customize your model visualization.Discuss the results of the logistic regression model and what they suggest about the factors that contribute to employee attrition.
## run a logistic regression model to predict employee attrition
## save the model as hr_attrition_glm_model
hr_attrition_glm_model <- glm(
bi_attrition ~ salary + years_at_company + manager_rating + work_life_balance,
data = hr_key_vars,
family = binomial() )
## print the summary of the model using the summary function
summary(hr_attrition_glm_model)
Call:
glm(formula = bi_attrition ~ salary + years_at_company + manager_rating +
work_life_balance, family = binomial(), data = hr_key_vars)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.685e+00 1.905e-01 14.093 <2e-16 ***
salary -3.630e-06 4.085e-07 -8.888 <2e-16 ***
years_at_company -6.334e-01 1.476e-02 -42.915 <2e-16 ***
manager_rating 4.611e-03 3.808e-02 0.121 0.904
work_life_balance 2.757e-02 3.194e-02 0.863 0.388
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8574.5 on 6708 degrees of freedom
Residual deviance: 4782.8 on 6704 degrees of freedom
(190 observations deleted due to missingness)
AIC: 4792.8
Number of Fisher Scoring iterations: 5
## install sjPlot package and use tab_model function to display the summary of the model
library(sjPlot)
tab_model(hr_attrition_glm_model)| bi attrition | |||
| Predictors | Odds Ratios | CI | p |
| (Intercept) | 14.66 | 10.12 – 21.35 | <0.001 |
| salary | 1.00 | 1.00 – 1.00 | <0.001 |
| years at company | 0.53 | 0.52 – 0.55 | <0.001 |
| manager rating | 1.00 | 0.93 – 1.08 | 0.904 |
| work life balance | 1.03 | 0.97 – 1.09 | 0.388 |
| Observations | 6709 | ||
| R2 Tjur | 0.501 | ||
## use plot_model function to visualize the model coefficients
plot_model(hr_attrition_glm_model, type = "est", show.values = TRUE, value.offset = .3)Employee attrition analysis reveals that salary and years at company are the most significant predictors of whether an employee will stay or leave, with both factors showing strong statistical significance (p < 2e-16). The negative correlation between years at company and attrition (-0.7) suggests that employees are most vulnerable to leaving during their early years, making this period critical for retention efforts. Surprisingly, factors such as manager ratings and work-life balance showed minimal impact on attrition decisions, indicating that monetary compensation and tenure play a more decisive role in employee retention than previously assumed. Based on these findings, HR interventions should prioritize competitive salary structures and enhanced support during employees’ early years with the company through mentorship programs and clear career progression paths. The analysis also revealed that longer-tenured employees are significantly less likely to leave (odds ratio 0.53), highlighting the importance of investing in long-term employee development and recognition programs. This data-driven approach to understanding attrition patterns enables organizations to develop more targeted and effective retention strategies, focusing resources where they will have the most impact on reducing employee turnover.
5.4 Analysis of compensation and turnover
Compare the average monthly income of employees who left the company (
bi_attrition = 1) and those who stayed (bi_attrition = 0). Use thet.testfunction to conduct a t-test and determine if there is a significant difference in average monthly income between the two groups. Save the results in a variable calledattrition_ttest_results.Install the
reportpackage and use thereportfunction to generate a report of the t-test results.Install the
ggstatsplotpackage and use theggbetweenstatsfunction to visualize the distribution of monthly income for employees who left and those who stayed. Make sure to map thebi_attritionvariable to thexargument and thesalaryvariable to theyargument.Visualize the
salaryvariable for employees who left and those who stayed usinggeom_histogramwithgeom_freqpoly. Make sure to facet the plot by thebi_attritionvariable and applyalphaon the histogram plot.Provide recommendations on whether revising compensation policies could be an effective retention strategy.
## compare the average monthly income of employees who left and those who stayed
attrition_ttest_results <- t.test(salary ~ bi_attrition, data = hr_perf_dta)
## print the results of the t-test
print(attrition_ttest_results)
Welch Two Sample t-test
data: salary by bi_attrition
t = 18.869, df = 5524.2, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
38577.82 47523.18
sample estimates:
mean in group 0 mean in group 1
125007.26 81956.76
## install the report package and use the report function to generate a report of the t-test results
library(report)
attrition_ttest_results <- t.test(salary ~ bi_attrition, data = hr_perf_dta)
report_ttest <- report(attrition_ttest_results)
# Print the report
report_ttestEffect sizes were labelled following Cohen's (1988) recommendations.
The Welch Two Sample t-test testing the difference of salary by bi_attrition
(mean in group 0 = 1.25e+05, mean in group 1 = 81956.76) suggests that the
effect is positive, statistically significant, and medium (difference =
43050.50, 95% CI [38577.82, 47523.18], t(5524.24) = 18.87, p < .001; Cohen's d
= 0.51, 95% CI [0.45, 0.56])
# install ggstatsplot package and use ggbetweenstats function to visualize the distribution of monthly income for employees who left and those who stayed
library(ggstatsplot)
ggbetweenstats(
data = hr_perf_dta,
x = bi_attrition,
y = salary,
xlab = "Attrition (0 = Stayed, 1 = Left)",
ylab = "Monthly Income",
title = "Distribution of Monthly Income for Employees Who Left vs Stayed",
ggtheme = ggplot2::theme_bw(7)
)# create histogram and frequency polygon of salary for employees who left and those who stayed
library(ggplot2)
library(scales)
ggplot(hr_perf_dta, aes(x = salary, fill = factor(bi_attrition))) +
geom_histogram(alpha = 0.6, position = "identity", bins = 12) + # Use bins for better control
scale_fill_manual(values = c("red", "yellow"), labels = c("Stayed", "Left")) +
labs(title = "Salary Distribution for Employees Who Stayed vs. Left",
x = "Salary",
y = "Count",
fill = "Attrition Status") +
scale_x_continuous(limits = c(0, 600000),
breaks = seq(0, 600000, by = 50000),
labels = comma) + # Format x-axis labels with commas
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Slant x-axis labelsThe statistical analysis reveals a striking disparity in compensation between employees who remained with the company and those who departed, with a significant difference of 43,050 in mean salaries (125,007 vs. 81,957). The robust statistical evidence, supported by a Welch Two Sample t-test (p < 2.2e-16) and a medium effect size (Cohen’s d = 0.51), confirms this difference is not due to chance. Visualizations demonstrate a clear pattern where departing employees cluster in lower salary ranges (50,000-100,000), while those who stayed show a broader distribution across higher salary ranges. However, the overlapping distributions suggest that salary, while important, is not the sole determining factor in employee retention decisions. The data strongly indicates that the company’s current compensation structure may be contributing to turnover, particularly among employees in lower salary brackets. Based on these findings, a strategic revision of compensation policies, focusing on competitive market-rate adjustments for lower-paid employees and clearer salary progression paths, could serve as an effective retention strategy.
5.5 Employee satisfaction and performance analysis
Analyze the average performance ratings (both
ManagerRatingandSelfRating) of employees who left vs. those who stayed. Use thegroup_byandcountfunctions to calculate the average performance ratings for each group.Visualize the distribution of
SelfRatingfor employees who left and those who stayed using a bar plot. Use theggplotfunction to create the plot and map theSelfRatingvariable to thexargument and thebi_attritionvariable to thefillargument.Similarly, visualize the distribution of
ManagerRatingfor employees who left and those who stayed using a bar plot. Make sure to map theManagerRatingvariable to thexargument and thebi_attritionvariable to thefillargument.Create a boxplot of
salarybyjob_satisfactionandbi_attritionto analyze the relationship between salary, job satisfaction, and attrition. Use thegeom_boxplotfunction to create the plot and map thesalaryvariable to thexargument, thejob_satisfactionvariable to theyargument, and thebi_attritionvariable to thefillargument. You need to transform thejob_satisfactionandbi_attritionvariables into factors before creating the plot or within theggplotfunction.Discuss the results of the analysis and provide recommendations for HR interventions based on the findings.
# Analyze the average performance ratings (both ManagerRating and SelfRating) of employees who left vs. those who stayed.
library(dplyr)
avg_ratings <- hr_perf_dta %>%
group_by(bi_attrition) %>%
summarise(
avg_manager_rating = mean(manager_rating, na.rm = TRUE),
avg_self_rating = mean(self_rating, na.rm = TRUE),
count_employees = n()
)
# View the average ratings
print(avg_ratings)# A tibble: 2 × 4
bi_attrition avg_manager_rating avg_self_rating count_employees
<dbl> <dbl> <dbl> <int>
1 0 3.48 3.98 4638
2 1 3.46 3.99 2261
# Visualize the distribution of SelfRating for employees who left and those who stayed using a bar plot.
self_rating_dist <- hr_perf_dta %>%
group_by(bi_attrition, self_rating) %>%
summarise(count = n(), .groups = 'drop')
#bar plot
ggplot(self_rating_dist, aes(x = factor(self_rating), y = count, fill = factor(bi_attrition))) +
geom_bar(stat = "identity", position = "dodge") + # Use position = "dodge" for side-by-side bars
scale_fill_manual(values = c("green", "yellow"), labels = c("Stayed", "Left")) +
labs(title = "Distribution of SelfRating for Employees Who Stayed vs. Left",
x = "Self Rating",
y = "Count",
fill = "Attrition Status") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))# Visualize the distribution of ManagerRating for employees who left and those who stayed using a bar plot.
manager_rating_dist <- hr_perf_dta %>%
group_by(bi_attrition, manager_rating) %>%
summarise(count = n(), .groups = 'drop')
#bar plot
ggplot(manager_rating_dist, aes(x = factor(manager_rating), y = count, fill = factor(bi_attrition))) +
geom_bar(stat = "identity", position = "dodge") + # Use position = "dodge" for side-by-side bars
scale_fill_manual(values = c("blue", "red"), labels = c("Stayed", "Left")) +
labs(title = "Distribution of ManagerRating for Employees Who Stayed vs. Left",
x = "Manager Rating",
y = "Count",
fill = "Attrition Status") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))# create a boxplot of salary by job_satisfaction and bi_attrition to analyze the relationship between salary, job satisfaction, and attrition.
ggplot(hr_perf_dta, aes(x = job_satisfaction, y = salary, fill = factor(bi_attrition))) +
geom_boxplot(alpha = 0.7, position = position_dodge(width = 0.8)) + # Boxplot with slight transparency
scale_fill_manual(values = c("skyblue", "violet"), labels = c("Stayed", "Left")) +
labs(title = "Salary by Job Satisfaction and Attrition Status",
x = "Job Satisfaction Level",
y = "Salary",
fill = "Attrition Status") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Manager ratings demonstrate a clear pattern where employees who stayed received higher ratings overall, with a larger proportion of “stayed” employees in the 4-5 rating range compared to those who left. Interestingly, self-ratings show minimal difference between those who stayed and left (3.98 vs 3.99 average), suggesting employees maintain similar perceptions of their own performance regardless of their retention status. The salary distribution across job satisfaction levels indicates that higher compensation doesn’t necessarily guarantee higher job satisfaction, as evidenced by the presence of both high and low salaries across all satisfaction levels. Manager perceptions appear to be a more reliable indicator of potential attrition than self-ratings, with lower manager ratings correlating more strongly with employee departures. This analysis suggests that while salary plays a role in retention, the quality of the employee-manager relationship and job satisfaction are equally important factors in predicting and preventing attrition.
5.6 Work-life balance and retention strategies
At this point, you are already well aware of the dataset and the possible factors that contribute to employee attrition. Using your R skills, accomplish the following tasks:
Analyze the distribution of WorkLifeBalance ratings for employees who left versus those who stayed.
Use visualizations to show the differences.
Assess whether employees with poor work-life balance are more likely to leave.
You have the freedom how you will accomplish this task. Be creative and provide insights that will help HR develop effective retention strategies.
#Analyze the distribution of WorkLifeBalance ratings for employees who left versus those who stayed
work_life_balance_dist <- hr_perf_dta %>%
group_by(bi_attrition, work_life_balance) %>%
summarise(count = n(), .groups = 'drop')
# Create a bar plot
ggplot(work_life_balance_dist, aes(x = factor(work_life_balance), y = count, fill = factor(bi_attrition))) +
geom_bar(stat = "identity", position = "dodge") + # Use position = "dodge" for side-by-side bars
scale_fill_manual(values = c("purple", "orange"), labels = c("Stayed", "Left")) +
labs(title = "Distribution of WorkLifeBalance Ratings for Employees Who Stayed vs. Left",
x = "Work-Life Balance Rating",
y = "Count",
fill = "Attrition Status") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))# Calculate attrition rates by work-life balance
attrition_rate_wlb <- hr_perf_dta %>%
group_by(work_life_balance) %>%
summarise(
total_employees = n(),
total_attrition = sum(bi_attrition == 1),
attrition_rate = (total_attrition / total_employees) * 100
)
# Print the attrition rate summary
print(attrition_rate_wlb)# A tibble: 6 × 4
work_life_balance total_employees total_attrition attrition_rate
<int> <int> <int> <dbl>
1 1 121 37 30.6
2 2 1702 568 33.4
3 3 1670 580 34.7
4 4 1706 560 32.8
5 5 1510 516 34.2
6 NA 190 0 0
# bar plot of attrition rates by work-life balance
ggplot(attrition_rate_wlb, aes(x = factor(work_life_balance), y = attrition_rate)) +
geom_bar(stat = "identity", fill = "lightblue") +
labs(title = "Attrition Rate by Work-Life Balance Rating",
x = "Work-Life Balance Rating",
y = "Attrition Rate (%)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))5.7 Recommendations for HR interventions
Based on the analysis conducted, provide recommendations for HR interventions that could help reduce employee attrition and improve overall employee satisfaction and performance. You may use the following question as guide for your recommendations and discussions.
What are the key factors contributing to employee attrition in the company?
-Based on the statistical analysis, salary (p < 2e-16) and years at company are the primary drivers of employee attrition, with both showing significant negative correlations. Surprisingly, factors like manager ratings and work-life balance showed minimal impact on attrition decisions, suggesting that monetary compensation and tenure are more crucial than previously thought.
Which factors are most strongly correlated with attrition?
-The analysis reveals that years at company has the strongest negative correlation (-0.7) with attrition, indicating that employees are most likely to leave during their early years. Salary also shows a significant correlation (-0.2), demonstrating that lower compensation increases attrition risk.
What strategies could be implemented to improve employee retention and satisfaction?
-HR should focus on implementing competitive salary structures with regular market adjustments and creating comprehensive early-career support programs including mentorship and clear career progression paths. Additionally, developing long-term incentive plans and recognition programs can help retain employees during their crucial early years when attrition risk is highest.
How can HR leverage the insights from the analysis to develop effective retention strategies?
-HR should prioritize resources on the first few years of employment where attrition risk is highest, implementing targeted interventions such as enhanced onboarding, regular check-ins, and competitive compensation packages. They should also develop data-driven monitoring systems to track the effectiveness of these interventions and adjust strategies based on ongoing analysis of retention metrics.
What are the potential benefits of implementing these strategies for the company?
-Implementing these targeted retention strategies can lead to significant cost savings through reduced recruitment and training expenses, while also maintaining valuable organizational knowledge and improving team stability. The company will also benefit from increased employee engagement, stronger company culture, and improved productivity through better team continuity and reduced disruption from turnover.